Improving the Performance of K-means Clustering Algorithm
نویسنده
چکیده
This paper proposed two updating methods to improve the clustering performance of adaptive k-means clustering. The proposed updating methods are suitable for off-line and on-line clustering. The capability of the updating methods are then compared to the existing updating methods using simulated and real data sets. Simulation results showed that the proposed updating methods have significantly improved the overall performance of RBF network. This paper also investigates some properties of adaptation method for on-line adaptive k-means clustering algorithm. Back to top Introduction The centre locations will influence the performance of radial basis function (RBF) networks. Poggio and Girosi [14] used all the training data as centres in their regularisation network that is based on RBF network. However, this may lead to network overfitting as the number of data becomes too large. To overcome this problem Poggio and Girosi [14] proposed a network with a finite number of centres. file:///C|/Documents%20and%20Settings/Ponn/Desktop/ijcim/past_editions/1998V06N2/improve_3.htm (1 of 19)24/8/2549 9:23:51 IMPROVING THE PERFORMANCE OF K-MEANS CLUSTERING ALGORITHM They also showed that a gradient descent approach used to update the RBF centres actually moved the centres towards the majority of the data, suggesting that a clustering algorithm may be used to position the centres. K-means clustering is the most widely used clustering algorithm to position the RBF centres. Its simplicity and ability to perform on-line clustering may inspire this choice. However, k-means clustering algorithm can be sensitive to the initial centres and the search for the optimum centre locations may result in poor local minima. Many attempts have been made to minimise these problems [5], [6], [9], [11] and [16]. In this paper two updating rules were suggested as alternatives or improvements to the standard adaptive k-means clustering algorithm. The updating methods are proposed to give better overall RBF network performance rather than good clustering performance. However, there is a strong correlation between good clustering and the performance of the RBF network. The sensitivity of the RBF network to the centre locations will also be studied. Back to top K-means Clustering Problems K-means clustering algorithm works on the assumption that the initial centres are provided. The search for the final clusters or centres starts from these initial centres. Without a proper initialisation the algorithm may generate a set of poor final centres and this problem can become serious if the data are clustered using an on-line k-means clustering algorithm. In general, there are three basic problems that normally arise during clustering namely dead centres, local minima and centre redundancy. Dead centres are centres that have no members or associated data. These centres are normally located between two active centres or outside the data range. The problem may arise due to bad initial centres, possibly because the centres have been initialised too far away from the data. Therefore, it is a good idea to select the initial centres randomly from the training data or to set them to some random values within the data range. However, this does not guarantee that all the centres are equally active. Some centres may have too many members and be frequently updated during the clustering process whereas some other centres may have only a few members and are hardly ever updated. The centres in a RBF network should be selected to minimise the total distance between the data and the centres so that the centres can properly represent the data. A simple and widely used square error cost function can be employed to measure the distance, which is defined as: (1) where N, and nc are the number of data and the number of centres respectively; vi is the data sample belonging to centre cj. Here, is taken to be an Euclidean norm although other distance measures can also be used. During the clustering process, the centres are adjusted according to a certain set of rules such that the total distance in equation (1) is minimised. However, in the process of searching for the global minima the centres frequently become trapped at local minima. Poor local minima may be avoided by using algorithms such as simulated annealing, stochastic gradient descent, genetic file:///C|/Documents%20and%20Settings/Ponn/Desktop/ijcim/past_editions/1998V06N2/improve_3.htm (2 of 19)24/8/2549 9:23:51 IMPROVING THE PERFORMANCE OF K-MEANS CLUSTERING ALGORITHM algorithms, etc. However, these techniques normally involve heavy computation and not suitable for online clustering. In the present study, the improvements are made based on the adaptive k-means clustering, which do not require heavy computation. In order to give a good modelling performance, the RBF network should have sufficient centres to represent the identified data. However, as the number of centre increases the tendency for the centres to be located at the same position or very close to each other is also increased. There is no point in adding extra centres if the additional centres are located very close to the centres that already exist. However, this is the normal phenomenon in k-means clustering and the unconstrained steepest descent algorithm, as the number of centres becomes sufficiently large [4]. Back to top K-means Clustering Algorithm There are two existing basic versions of k-means clustering, a non-adaptive version introduced by Lloyd [12] and an adaptive version introduced by MacQueen [13]. The most commonly used k-means clustering is the adaptive k-means clustering based on the Euclidean distance [5]. Adaptive k-means clustering can be considered as a special case of the gradient descent algorithm where only the winning cluster is adjusted at each learning step. This paper concentrates only on adaptive k-means clustering as the algorithm can be used for on-line training of RBF network. Adaptive k-means clustering tries to minimise the cost function in equation (1) by searching for the centre cj on-line as the data are presented. As the data sample is presented, the Euclidean distances between the data sample and all the centres are calculated and the nearest centre is updated according to: (2) where z indicates the nearest centre to the data v(t). Notice that, the centres and the data are written in terms of time t where cz(t-1) represents the centre location at the previous clustering step. The adaptation rate, , can be selected in a number of ways. MacQueen [13] set , where nz (t) is the number of data samples that have been assigned to the centre up to the time t. Darken and Moody [5] used a constant adaptation rate and a square root method . Another method called search-then-converge has been introduced by Darken and Moody [6]. According to this method is updated using:
منابع مشابه
A Clustering Based Location-allocation Problem Considering Transportation Costs and Statistical Properties (RESEARCH NOTE)
Cluster analysis is a useful technique in multivariate statistical analysis. Different types of hierarchical cluster analysis and K-means have been used for data analysis in previous studies. However, the K-means algorithm can be improved using some metaheuristics algorithms. In this study, we propose simulated annealing based algorithm for K-means in the clustering analysis which we refer it a...
متن کاملA Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS
Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...
متن کاملPersistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm
Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...
متن کاملGROUND MOTION CLUSTERING BY A HYBRID K-MEANS AND COLLIDING BODIES OPTIMIZATION
Stochastic nature of earthquake has raised a challenge for engineers to choose which record for their analyses. Clustering is offered as a solution for such a data mining problem to automatically distinguish between ground motion records based on similarities in the corresponding seismic attributes. The present work formulates an optimization problem to seek for the best clustering measures. In...
متن کاملCombination of Transformed-means Clustering and Neural Networks for Short-Term Solar Radiation Forecasting
In order to provide an efficient conversion and utilization of solar power, solar radiation datashould be measured continuously and accurately over the long-term period. However, the measurement ofsolar radiation is not available to all countries in the world due to some technical and fiscal limitations. Hence,several studies were proposed in the literature to find mathematical and physical mod...
متن کاملImproved COA with Chaotic Initialization and Intelligent Migration for Data Clustering
A well-known clustering algorithm is K-means. This algorithm, besides advantages such as high speed and ease of employment, suffers from the problem of local optima. In order to overcome this problem, a lot of studies have been done in clustering. This paper presents a hybrid Extended Cuckoo Optimization Algorithm (ECOA) and K-means (K), which is called ECOA-K. The COA algorithm has advantages ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006